AITopics | document retrieval

Collaborating Authors

document retrieval

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

ac112e8ffc4e5b9ece32070440a8ca43-Paper-Conference.pdf

Neural Information Processing SystemsFeb-16-2026, 12:07:56 GMT

information retrieval, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Country:

Asia > China (0.05)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
Asia > Singapore (0.04)

Genre: Research Report (0.46)

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
(2 more...)

Add feedback

A Neural Corpus Indexer for Document Retrieval

Neural Information Processing SystemsDec-24-2025, 22:07:29 GMT

Current state-of-the-art document retrieval solutions mainly follow an index-retrieve paradigm, where the index is hard to be directly optimized for the final retrieval target. In this paper, we aim to show that an end-to-end deep neural network unifying training and indexing stages can significantly improve the recall performance of traditional methods. To this end, we propose Neural Corpus Indexer (NCI), a sequence-to-sequence network that generates relevant document identifiers directly for a designated query. To optimize the recall performance of NCI, we invent a prefix-aware weight-adaptive decoder architecture, and leverage tailored techniques including query generation, semantic document identifiers, and consistency-based regularization. Empirical studies demonstrated the superiority of NCI on two commonly used academic benchmarks, achieving +21.4% and +16.8% relative enhancement for Recall@1 on NQ320k dataset and R-Precision on TriviaQA dataset, respectively, compared to the best baseline method.

document retrieval, name change, neural corpus indexer, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.60)

Add feedback

M3DR: Towards Universal Multilingual Multimodal Document Retrieval

Kolavi, Adithya S, Jain, Vyoman

arXiv.org Artificial IntelligenceDec-4-2025

Multimodal document retrieval systems have shown strong progress in aligning visual and textual content for semantic search. However, most existing approaches remain heavily English-centric, limiting their effectiveness in multilingual contexts. In this work, we present M3DR (Multilingual Multimodal Document Retrieval), a framework designed to bridge this gap across languages, enabling applicability across diverse linguistic and cultural contexts. M3DR leverages synthetic multilingual document data and generalizes across different vision-language architectures and model sizes, enabling robust cross-lingual and cross-modal alignment. Using contrastive training, our models learn unified representations for text and document images that transfer effectively across languages. We validate this capability on 22 typologically diverse languages, demonstrating consistent performance and adaptability across linguistic and script variations. We further introduce a comprehensive benchmark that captures real-world multilingual scenarios, evaluating models under monolingual, multilingual, and mixed-language settings. M3DR generalizes across both single dense vector and ColBERT-style token-level multi-vector retrieval paradigms. Our models, NetraEmbed and ColNetraEmbed achieve state-of-the-art performance with ~150% relative improvements on cross-lingual retrieval.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2512.03514

Country:

Asia (0.67)
North America > United States (0.28)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.94)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.93)

Add feedback

QBR: A Question-Bank-Based Approach to Fine-Grained Legal Knowledge Retrieval for the General Public

Yuan, Mingruo, Kao, Ben, Wu, Tien-Hsuan

arXiv.org Artificial IntelligenceNov-21-2025

Retrieval of legal knowledge by the general public is a challenging problem due to the technicality of the professional knowledge and the lack of fundamental understanding by laypersons on the subject. Traditional information retrieval techniques assume that users are capable of formulating succinct and precise queries for effective document retrieval. In practice, however, the wide gap between the highly technical contents and untrained users makes legal knowledge retrieval very difficult. We propose a methodology, called QBR, which employs a Questions Bank (QB) as an effective medium for bridging the knowledge gap. We show how the QB is used to derive training samples to enhance the embedding of knowledge units within documents, which leads to effective fine-grained knowledge retrieval. We discuss and evaluate through experiments various advantages of QBR over traditional methods. These include more accurate, efficient, and explainable document retrieval, better comprehension of retrieval results, and highly effective fine-grained knowledge retrieval. We also present some case studies and show that QBR achieves social impact by assisting citizens to resolve everyday legal concerns.

information retrieval, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

doi: 10.24963/ijcai.2025/1109

2505.04883

Country:

Europe (0.46)
North America > United States (0.28)

Genre: Research Report (1.00)

Industry:

Law (1.00)
Health & Medicine > Therapeutic Area (0.46)
Government > Regional Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.94)

Add feedback

Attention Grounded Enhancement for Visual Document Retrieval

Cui, Wanqing, Huang, Wei, Guo, Yazhi, Hu, Yibo, Jin, Meiguang, Ma, Junfeng, Bi, Keping

arXiv.org Artificial IntelligenceNov-18-2025

Visual document retrieval requires understanding heterogeneous and multi-modal content to satisfy information needs. Recent advances use screenshot-based document encoding with fine-grained late interaction, significantly improving retrieval performance. However, retrievers are still trained with coarse global relevance labels, without revealing which regions support the match. As a result, retrievers tend to rely on surface-level cues and struggle to capture implicit semantic connections, hindering their ability to handle non-extractive queries. To alleviate this problem, we propose a \textbf{A}ttention-\textbf{G}rounded \textbf{RE}triever \textbf{E}nhancement (AGREE) framework. AGREE leverages cross-modal attention from multimodal large language models as proxy local supervision to guide the identification of relevant document regions. During training, AGREE combines local signals with the global signals to jointly optimize the retriever, enabling it to learn not only whether documents match, but also which content drives relevance. Experiments on the challenging ViDoRe V2 benchmark show that AGREE significantly outperforms the global-supervision-only baseline. Quantitative and qualitative analyses further demonstrate that AGREE promotes deeper alignment between query terms and document regions, moving beyond surface-level matching toward more accurate and interpretable retrieval. Our code is available at: https://anonymous.4open.science/r/AGREE-2025.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2511.13415

Country:

Asia (0.47)
North America > United States (0.28)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.66)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

ColMate: Contrastive Late Interaction and Masked Text for Multimodal Document Retrieval

Masry, Ahmed, Thakkar, Megh, Bechard, Patrice, Madhusudhan, Sathwik Tejaswi, Awal, Rabiul, Mishra, Shambhavi, Suresh, Akshay Kalkunte, Daruru, Srivatsava, Hoque, Enamul, Gella, Spandana, Scholak, Torsten, Rajeswar, Sai

arXiv.org Artificial IntelligenceNov-4-2025

Retrieval-augmented generation has proven practical when models require specialized knowledge or access to the latest data. However, existing methods for multimodal document retrieval often replicate techniques developed for text-only retrieval, whether in how they encode documents, define training objectives, or compute similarity scores. To address these limitations, we present ColMate, a document retrieval model that bridges the gap between multimodal representation learning and document retrieval. ColMate utilizes a novel OCR-based pretraining objective, a self-supervised masked contrastive learning objective, and a late interaction scoring mechanism more relevant to multimodal document structures and visual characteristics. ColMate obtains 3.61% improvements over existing retrieval models on the ViDoRe V2 benchmark, demonstrating stronger generalization to out-of-domain benchmarks.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2511.00903

Country: Europe > Switzerland (0.28)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.50)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.36)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.35)

Add feedback

B.2 Hierarchical k -means for semantic identifier The pseudo code of hierarchical k -means is detailed in in Algorithm 1. Algorithm 1: Hierarchical k -means.

Add feedback

Model-enhanced Vector Index

Neural Information Processing SystemsOct-9-2025, 04:26:23 GMT

The document retrieval stage retrieves candidate documents related to the query, while the ranking stage re-ranks the documents using precise ranking scores.

information retrieval, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Country:

Asia > China (0.05)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
Asia > Singapore (0.04)

Genre: Research Report (0.46)

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
(2 more...)

Add feedback

SearchInstruct: Enhancing Domain Adaptation via Retrieval-Based Instruction Dataset Creation

Barati, Iman, Amiri, Mostafa, Faili, Heshaam

arXiv.org Artificial IntelligenceSep-16-2025

Supervised Fine-Tuning (SFT) is essential for training large language models (LLMs), significantly enhancing critical capabilities such as instruction following and in-context learning. Nevertheless, creating suitable training datasets tailored for specific domains remains challenging due to unique domain constraints and data scarcity. In this paper, we propose SearchInstruct, an innovative method explicitly designed to construct high quality instruction datasets for SFT. Our approach begins with a limited set of domain specific, human generated questions, which are systematically expanded using a large language model. Subsequently, domain relevant resources are dynamically retrieved to generate accurate and contextually appropriate answers for each augmented question. Experimental evaluation demonstrates that SearchInstruct enhances both the diversity and quality of SFT datasets, leading to measurable improvements in LLM performance within specialized domains. Additionally, we show that beyond dataset generation, the proposed method can also effectively facilitate tasks such as model editing, enabling efficient updates to existing models. To facilitate reproducibility and community adoption, we provide full implementation details, the complete set of generated instruction response pairs, and the source code in a publicly accessible Git repository: [https://github.com/mostafaamiri/SearchInstruct](https://github.com/mostafaamiri/SearchInstruct)

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2509.10708

Country: